Project Outline

Project Overview

Hypothesis:

SMG6 plays role in mediating nonsense-mediated mRNA decay (NMD). Therefore, I hypothesize that SMG6 KO cells will show upregulation of other genes involved in NMD such as UPF1 as compensation.

Dataset information:

RNA Extraction: NPCs were isolated from E13.5 embryo brains (cortices CTX and ganglionic eminences GE) after crossing \(Smg6^{flox}\) and \(Rosa26-CreER^{T2}\) mice. Kept in neurosphere cultures for 2 days followed by 4-Hydroxytamoxifen (4-OHT) treatment to induce the Smg6 deletion. Total RNA was isolated 6 days after 4-OHT treatment from cells with genotypes

  1. \(Smg6^{flox/flox};Rosa26-CreER^{T2}\) without 4-OHT (ctr)
  2. \(Smg6^{flox/flox}\) treated with 4-OHT (ctr + 4-OHT)
  3. \(Smg6^{flox/flox};Rosa26-CreER^{T2}\) treated with 4-OHT (Smg6-iKO)

Limitations: the samples are from in vitro cell cultures, therefore the generalizability of findings towards living organisms is limited. Also, the samples are mouse cells, which means the implication of the result is not directly related to human.


Organization Layout of files

Figure. SMG6 project organization


GitHub Repository

The GitHub Repository of this project can be found here.

You can find all scripts in the git repository.


Raw Data Processing

Download three metadata files (manually)

Download experiment metadata from NCBI.

Download run accession and ftp links from ENA. PRJNA776901 -> download report -> TSV

Download Fastq files (raw data)

Navigate to working directory:

ssh.curie.pbtech
cd project_scripts
sbatch download_fastq.sh 

SRR16684379 is shown as unavailable on ENA. Instead, I used fasterq-dump SRR16684379 to download this sample.

Trim galore

sbatch run_trim_galore.sh "/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/fastq_raw"

trim_galore:

  • --gzip: compress output files.
  • --illumina: specified because the RNA seq uses illumina platform.

FastQC

sbatch run_fastqc.sh "/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/fastq_raw"
sbatch run_fastqc.sh "/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/fastq_trimmed"

fastqc --extract will uncompress the zipped output files.

copy the html files to local:

cd /Users/jennie2000/Desktop/5004ANGSD/angsd_project/fastqc
scp xil4009@aphrodite.med.cornell.edu:/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/fastq_raw/fastqc/*.html .

MultiQC

mamba activate multiqc
multiqc /athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/fastq_raw/fastqc/*_fastqc/ 

cd /Users/jennie2000/Desktop/5004ANGSD/angsd_project/
scp xil4009@aphrodite.med.cornell.edu:/home/xil4009/project_scripts/multiqc_report.html .

MultiQC results for trimmed fastq:

fastqc_per_base_sequence_quality_plot

fastqc_per_sequence_gc_content_plot

fastqc_sequence_duplication_levels_plot

Download reference & star index

sbatch download_m39.sh 
# will download m39 assembly and annotation, save in /athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/genome
sbatch index_STAR_m39.sh
# STAR INDEX saved in /athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/genome/m39_STARindex

Genome assembly & annotation are download from Ensembl.

Genome assembly: Mus_musculus.GRCm39.dna.primary_assembly.fa.gz

Annotation: Mus_musculus.GRCm39.109.gtf.gz

Alignment - STAR

Run alignment on trimmed fastq files:

sbatch run_star_alignment.sh "/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/fastq_trimmed"
# alignment results saved in /athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/alignment_star

STAR --outSAMtype BAM SortedByCoordinate: outputs sorted bam files.

Alignment QC

sbatch run_alignqc.sh 

qorts QC:

  • --generatePlots: outputs plots.
  • --singleEnded: specified because the RNA seq protocol is single ended.
  • --stranded: specified because the RNA seq protocol is stranded.

Download qorts QC plots to local:

cd /Users/jennie2000/Desktop/5004ANGSD/angsd_project

# this will search the remote path for ".pdf" files
# and then copying them, preserving the directory structure 
rsync -arv --prune-empty-dirs \
  --include="*/" --include="*.pdf" --exclude="*" \
  xil4009@aphrodite.med.cornell.edu:/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/alignment_qc .

The first --include matches the folders, the second the pdf file, then the --exclude prevents downloading everything else.

Read distribution:

While the majority of reads are mapped to unique genes, a number of reads are mapped to intronic or UTR regions, which suggests usage of ribodepletion protocol.

Average gene body coverage:

Gene body coverage is biased towards 3’ end.

Strandedness:

Most reads are mapped to the first strand.

Inspect bam in IGV

  1. Open IGV (download) and use the top-left drop-down Genome menu to load the genome (mm39).
  2. Use the search bar to navigate to the gene ACTB.
  3. Download bam and bai
cd /Users/jennie2000/Desktop/5004ANGSD/angsd_project
scp -r xil4009@aphrodite.med.cornell.edu:/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/alignment_star/SRR16684375.Aligned.sortedByCoord.out.bam .
scp -r xil4009@aphrodite.med.cornell.edu:/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/alignment_star/SRR16684375.Aligned.sortedByCoord.out.bam.bai .

Gene Actb of control SRR16684375

Observations:

The library appears to be stranded, which is consistent with what is stated in the paper. There are a few alignments (blue) to the reverse strand. Some reads are mapped to non-exon regions, indicating ribo-depeletion protocol.

IGV gene coverage

Alignment Rates

mamba activate multiqc
multiqc /athena/angsd/scratch/xil4009/SMG6_RNA_SEQ 

Alignment rates and of 11 samples

Feature Counts

featureCounts documentation

flag -s: 0 (unstranded), 1 (stranded) and 2 (reversely stranded). 0 by default.

parameter -s 2 is used.

sbatch run_feature_counts.sh

Copy featureCounts.txt.summary folder to local:

cd /Users/jennie2000/Desktop/5004ANGSD/angsd_project
scp -r xil4009@aphrodite.med.cornell.edu:/athena/angsd/scratch/xil4009/SMG6_RNA_SEQ/feature_counts .

Import libraries

library(glue)
library(DESeq2)
library(ggplot2); theme_set(theme_bw(base_size = 16)) # for making plots 
library(magrittr) # for "pipe"-like coding in R
library(pheatmap)
library(AnnotationDbi)
library(RColorBrewer)
library(org.Mm.eg.db)
mm <- org.Mm.eg.db
library(EnhancedVolcano)

Normalizing Read Counts

Prepare metadata

If needed, install pandas:

library(reticulate)
py_install("pandas")
import os
import sys
import pandas as pd 

def add_sample_type(row):
    if row['Genotype'] == "Smg6 Flox/Flox; RosaCreER tg/+" and row['TREATMENT'] == "no treatment":
        return 'Control'
    elif row['Genotype'] == "Smg6 Flox/Flox; RosaCreER +/+" and row['TREATMENT'] == "4-OHT":
        return 'Control_4OHT'
    elif row['Genotype'] == "Smg6 Flox/Flox; RosaCreER tg/+" and row['TREATMENT'] == "4-OHT":
        return 'Smg6_iKO'

def sample_type(metadata, output):
    md = pd.read_csv(metadata)
    md['Samplename'] = md.apply(lambda row: add_sample_type(row), axis=1)
    # only keep columns 'Run' and 'Samplename'
    md = md[['Run', 'Samplename']]
    md.to_csv(output, sep=' ', index=False, header=False) 

sample_type("metadata/SraRunTable.txt", "metadata/runid_samplename.txt")

Import Feature Counts

fc <- read.table("feature_counts_strand2/featureCounts.txt", header=TRUE)
metadata <- read.table("metadata/runid_samplename.txt")
# set the first column as row names
row.names(metadata) <- metadata[, 1]
metadata[, 1] <- NULL  # remove the first column
# Sample name formatting
names(fc) <- gsub(".*(SRR)([0-9]+).*", "\\1\\2", names(fc))

col_names <- names(fc)

## add conditions to sample IDs 
for ( n in 1:length(col_names) ) {
  col <- col_names[n]
  # if the col is the ID 
  if (grepl("SRR",col)) {
    # get the condition of the ID
    condition <- metadata[col, 1]
    col_names[n] <- glue("{col}_{condition}")
  }
}

names(fc) <- col_names
fc_summary <- read.table("feature_counts/featureCounts.txt.summary", header=TRUE)
names(fc_summary) <- gsub(".*(SRR)([0-9]+).*", "\\1\\2", names(fc_summary))
fc_summary

DESeq2 object setup

colData: data.frame with all the variables you know about your samples, e.g., experimental condition, the type, and date of sequencing and so on. Its row.names should correspond to the unique sample names.

countData: should contain a matrix of the actual values associated with the genes and samples. Conveniently, this is almost exactly the format of the featureCounts output.

## gene IDs should be stored as row.names
row.names(fc) <- make.names(fc$Geneid)
## exclude the columns without read counts (columns 1 to 6 contain additional ## info such as genomic coordinates)
readcounts <- fc[ , -c(1:6)]
head(readcounts)
sample_info <- data.frame(condition = gsub("SRR[0-9]+_", "", names(readcounts)),
                         row.names = names(readcounts) )
sample_info

Generate the DESeqDataSet

DESeq.ds <- DESeqDataSetFromMatrix(countData = as.matrix(readcounts),
                                   colData = sample_info,
                                   design = ~ condition)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
# How many reads were sequenced for each sample ( = library sizes)
colSums(counts(DESeq.ds)) %>% barplot(las = 2, cex.names = 0.7)

Remove genes with no reads:

dim(DESeq.ds)
## [1] 57010    11
keep_genes <- rowSums(counts(DESeq.ds)) > 0
DESeq.ds <- DESeq.ds[ keep_genes, ]
dim(DESeq.ds)
## [1] 40239    11

Normalizing for sequencing depth and RNA composition differences

Calculating and applying the size factor

DESeq.ds <- estimateSizeFactors(DESeq.ds) # calculate SFs, add them to object 
plot( sizeFactors(DESeq.ds), colSums(counts(DESeq.ds)), # assess them
      ylab = "library sizes", xlab = "size factors", cex = .6 )

Transform

par(mfrow=c(1,2)) # to plot the two box plots next to each other
## bp of non-normalized
boxplot(log2(counts(DESeq.ds) +1), notch=TRUE,
        main = "Non-normalized read counts",
        ylab ="log2(read counts)", cex = .6, las=2)
## bp of size-factor normalized values
boxplot(log2(counts(DESeq.ds, normalized=TRUE) +1), notch=TRUE,
        main = "Size-factor-normalized read counts",
        ylab ="log2(read counts)", cex = .6, las=2)

Understanding more properties of read count data

Make a scatterplot of log normalized counts against each other to see how well the actual values correlate which each other per sample and gene.

## non-normalized read counts plus pseudocount
assay(DESeq.ds, "log.counts") <- log2(counts(DESeq.ds, normalized = FALSE) + 1)
## normalized read counts
assay(DESeq.ds, "log.norm.counts") <- log2(counts(DESeq.ds, normalized=TRUE) + 1)
par(mfrow=c(1,2))
DESeq.ds[, c("SRR16684376_Control","SRR16684375_Control")] %>%
  assay(.,  "log.norm.counts") %>%
  plot(., cex=.1, main = "76_Control vs. 75_Control")
DESeq.ds[, c("SRR16684379_Control_4OHT","SRR16684378_Control_4OHT")] %>%
  assay(.,  "log.norm.counts") %>%
  plot(., cex=.1, main = "79_Control_4OHT vs 78_Control_4OHT")

The fanning out of the points in the lower left corner indicates that read counts correlate less well between replicates when they are low. the lower the mean read counts per gene, the higher the standard deviation.

Mean-SD Plot:

## generate the base meanSdPlot using sequencing depth normalized log2(read counts) ## set up plotting frame
par(mfrow=c(1,1))
## generate the plot
msd_plot <- vsn::meanSdPlot(assay(DESeq.ds, "log.norm.counts"), 
                            ranks=FALSE, # show the data on the original scale
                            plot = FALSE)
msd_plot$gg +
  ggtitle("Sequencing depth normalized log2(read counts)") +
  ylab("standard deviation")

meanSdPlot manual: The red line depicts the running median estimator. If there is no variance-mean dependence, then the line should be approximately horizontal.

The plot here shows that there is some variance-mean dependence for genes with low read counts. This means that the data shows signs of heteroskedasticity. Many tools expect data to be homoskedastic, i.e., all variables should have similar variances.

Reducing the dependence of the variance on the mean

rlog transformation:

DESeq.rlog <- rlog(DESeq.ds, blind = TRUE)
## set blind = FALSE if the conditions are expected to introduce ## strong differences in a large proportion of the genes
## (blind means blind to the experimental design)
par(mfrow=c(1,2))
plot(assay(DESeq.ds, "log.norm.counts")[,1:2], cex=.1,
     main = "size factor and log2-transformed")
## the rlog-transformed counts are stored in the "assay" accessor
plot(assay(DESeq.rlog)[,1:2],
     cex=.1, main = "rlog transformed",
     xlab = colnames(assay(DESeq.rlog[,1])),
     ylab = colnames(assay(DESeq.rlog[,2])) )

Mean-SD Plot:

## rlog-transformed read counts
msd_plot <- vsn::meanSdPlot(assay(DESeq.rlog), ranks=FALSE, plot = FALSE)
msd_plot$gg + ggtitle("Following rlog transformation") +
  coord_cartesian(ylim = c(0,3))

Heatmap of the count matrix

select <- order(rowMeans(counts(DESeq.ds,normalized=TRUE)),
                decreasing=TRUE)[1:20]

cdata <- colData(DESeq.ds)

pheatmap(assay(DESeq.rlog)[select,], 
         cluster_rows=FALSE, 
         show_rownames=FALSE,
         cluster_cols=FALSE, 
         annotation_col=as.data.frame(cdata))

Heatmap of the sample-to-sample distances

sampleDists <- dist(t(assay(DESeq.rlog)))

sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(DESeq.rlog$condition)
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

PCA plot of the samples

plotPCA(DESeq.rlog, intgroup=c("condition"))

Differential Gene Expression Analysis

DESeq.ds
## class: DESeqDataSet 
## dim: 40239 11 
## metadata(1): version
## assays(3): counts log.counts log.norm.counts
## rownames(40239): ENSMUSG00000104385 ENSMUSG00000086053 ...
##   ENSMUSG00002074899 ENSMUSG00000095742
## rowData names(0):
## colnames(11): SRR16684376_Control SRR16684375_Control ...
##   SRR16684379_Control_4OHT SRR16684378_Control_4OHT
## colData names(2): condition sizeFactor
DESeq.ds$condition
##  [1] Control      Control      Control_4OHT Control      Control_4OHT
##  [6] Smg6_iKO     Smg6_iKO     Smg6_iKO     Smg6_iKO     Control_4OHT
## [11] Control_4OHT
## Levels: Control Control_4OHT Smg6_iKO
## the following uses the magrittr assignment pipe
DESeq.ds$condition %<>% relevel(ref="Control")
DESeq.ds$condition
##  [1] Control      Control      Control_4OHT Control      Control_4OHT
##  [6] Smg6_iKO     Smg6_iKO     Smg6_iKO     Smg6_iKO     Control_4OHT
## [11] Control_4OHT
## Levels: Control Control_4OHT Smg6_iKO
design(DESeq.ds)
## ~condition
DESeq.ds %<>% DESeq()
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# normalize for diffs in sequencing depth and abundance per sample
# gene-wise dispersion estimates across all samples
# fit a neg. binomial GLM and compute Wald stat for each gene

You can see that the DESeq object now has additional entries n the rowData slot:

DESeq.ds
## class: DESeqDataSet 
## dim: 40239 11 
## metadata(1): version
## assays(6): counts log.counts ... H cooks
## rownames(40239): ENSMUSG00000104385 ENSMUSG00000086053 ...
##   ENSMUSG00002074899 ENSMUSG00000095742
## rowData names(26): baseMean baseVar ... deviance maxCooks
## colnames(11): SRR16684376_Control SRR16684375_Control ...
##   SRR16684379_Control_4OHT SRR16684378_Control_4OHT
## colData names(2): condition sizeFactor
rowData(DESeq.ds) %>% colnames
##  [1] "baseMean"                                       
##  [2] "baseVar"                                        
##  [3] "allZero"                                        
##  [4] "dispGeneEst"                                    
##  [5] "dispGeneIter"                                   
##  [6] "dispFit"                                        
##  [7] "dispersion"                                     
##  [8] "dispIter"                                       
##  [9] "dispOutlier"                                    
## [10] "dispMAP"                                        
## [11] "Intercept"                                      
## [12] "condition_Control_4OHT_vs_Control"              
## [13] "condition_Smg6_iKO_vs_Control"                  
## [14] "SE_Intercept"                                   
## [15] "SE_condition_Control_4OHT_vs_Control"           
## [16] "SE_condition_Smg6_iKO_vs_Control"               
## [17] "WaldStatistic_Intercept"                        
## [18] "WaldStatistic_condition_Control_4OHT_vs_Control"
## [19] "WaldStatistic_condition_Smg6_iKO_vs_Control"    
## [20] "WaldPvalue_Intercept"                           
## [21] "WaldPvalue_condition_Control_4OHT_vs_Control"   
## [22] "WaldPvalue_condition_Smg6_iKO_vs_Control"       
## [23] "betaConv"                                       
## [24] "betaIter"                                       
## [25] "deviance"                                       
## [26] "maxCooks"

save DESeq object

save(DESeq.ds, file="DESeq.ds.Rdata")
save(DESeq.rlog, file="DESeq.rlog.Rdata")

Adjusting for multiple hypothesis testing with independent filtering

rowData(DESeq.ds)$WaldPvalue_condition_Smg6_iKO_vs_Control %>%
    hist(breaks=19, main="Raw p-values for Smg6-iKO vs Control")

rowData(DESeq.ds)$WaldPvalue_condition_Control_4OHT_vs_Control %>%
    hist(breaks=19, main="Raw p-values for Control-4OHT vs Control")

By default, will compare Smg6_iko vs control

DGE.results <- results(DESeq.ds, independentFiltering = TRUE, alpha = 0.05)
# the first line will tell you which comparison was done to achieve the log2FC

pairwise comparison between three conditions

res.ctl.smg6 <- results(DESeq.ds, contrast=c("condition","Smg6_iKO", "Control"), independentFiltering = TRUE, alpha = 0.05)
res.ctl.ct4oht <- results(DESeq.ds, contrast=c("condition","Control_4OHT", "Control"))
res.ctl4oht.smg6 <- results(DESeq.ds, contrast=c("condition","Smg6_iKO", "Control_4OHT"))
# the first line will tell you which comparison was done to achieve the log2FC 
head(res.ctl.smg6)
## log2 fold change (MLE): condition Smg6_iKO vs Control 
## Wald test p-value: condition Smg6 iKO vs Control 
## DataFrame with 6 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE      stat    pvalue
##                    <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000104385 0.9620537       0.533272  1.648386  0.323511  0.746308
## ENSMUSG00000086053 0.0831445      -1.039083  3.972769 -0.261551  0.793667
## ENSMUSG00000102135 7.1935060      -0.299940  0.653222 -0.459169  0.646112
## ENSMUSG00000101097 0.0977052       0.000000  3.972769  0.000000  1.000000
## ENSMUSG00000100764 5.1448527       0.883793  0.705590  1.252559  0.210366
## ENSMUSG00000102534 0.4503458      -1.256679  2.241398 -0.560668  0.575024
##                         padj
##                    <numeric>
## ENSMUSG00000104385        NA
## ENSMUSG00000086053        NA
## ENSMUSG00000102135        NA
## ENSMUSG00000101097        NA
## ENSMUSG00000100764        NA
## ENSMUSG00000102534        NA
summary(res.ctl.smg6)
## 
## out of 40239 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 568, 1.4%
## LFC < 0 (down)     : 1025, 2.5%
## outliers [1]       : 5, 0.012%
## low counts [2]     : 25743, 64%
## (mean count < 38)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# the DESeqResult object can basically be handled like a data.frame
table(res.ctl.smg6$padj < 0.05)
## 
## FALSE  TRUE 
## 12898  1593

Compare SMG6 gene count across groups

AnnotationDbi::select(mm, keys="Smg6",keytype="SYMBOL", columns="ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
plotCounts(DESeq.ds, gene="ENSMUSG00000038290", normalized=TRUE, xlab="", main="Smg6")

Assessing the DE results

Visual assessments using normalized read counts

sort DEG.results by padj
res.ctl.smg6.sorted <- res.ctl.smg6 %>% `[`(order(.$padj),)
res.ctl4oht.smg6.sorted <- res.ctl4oht.smg6 %>% `[`(order(.$padj),)
res.ctl.ct4oht.sorted <- res.ctl.ct4oht %>% `[`(order(.$padj),)

# save these three objects 
save(res.ctl.smg6.sorted, res.ctl4oht.smg6.sorted, res.ctl.ct4oht.sorted, file="DEG.results.sorted.Rdata")

write.table(res.ctl.smg6.sorted, "DEG_results_Control_vs_SMG6iKO.csv", sep=",", quote=FALSE, row.names=TRUE)
write.table(res.ctl4oht.smg6.sorted, "DEG_results_Control4OHT_vs_SMG6iKO.csv", sep=",", quote=FALSE, row.names=TRUE)
write.table(res.ctl.ct4oht.sorted, "DEG_results_Control_vs_Control4OHT.csv", sep=",", quote=FALSE, row.names=TRUE)
define the pair that will be used for subsequent analysis
DEG.res.sorted = res.ctl.smg6.sorted
adj p-values histogram
DEG.res.sorted$padj %>%
    hist(breaks=19, main="Adjusted p-values for Smg6 iKO vs Control")

heatmaps
# identify genes with the desired adjusted p-value cut-off
DGEgenes <- rownames(subset(DEG.res.sorted, padj < 0.05)) # extract rlog-transformed values into a matrix
rlog.dge <- DESeq.rlog[DGEgenes,] %>% assay

# heatmap of DEG sorted by p.adjust
pheatmap(rlog.dge, 
         scale="row",
         show_rownames=FALSE, 
         main="DGE Control-4OHT vs SMG-iKO (row-based z-score)")

MA plots
plotMA(DEG.res.sorted, alpha=0.05,
       main="Test: p.adj.value < 0.05", ylim = c(-4,4))

Volcano plots (with gene names)

Annotate with gene names:

annot.DGE <- AnnotationDbi::select(mm, keys=rownames(DEG.res.sorted),
                    keytype="ENSEMBL", columns="SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
DEG.res.sorted$ENSEMBL <- rownames(DEG.res.sorted)
DEG.res.sorted.df <- as.data.frame(DEG.res.sorted)
DEG.res.sorted.df <- merge(x = DEG.res.sorted.df, y = annot.DGE, by= "ENSEMBL", all.x=TRUE)
head(DEG.res.sorted.df)

Volcano plot:

vp1 <- EnhancedVolcano(DEG.res.sorted.df,
                      lab=DEG.res.sorted.df$SYMBOL,
                      x='log2FoldChange', y='padj',
                      pCutoff=0.05,
                      title="Control-4OHT VS. SMG6-iKO")
print(vp1)

DEG.res.sorted.df.sig <- subset(DEG.res.sorted.df, (abs(DEG.res.sorted.df$log2FoldChange) > 0.2 & DEG.res.sorted.df$padj < 0.05)) %>% `[`(order(.$padj),)

head(DEG.res.sorted.df.sig)

Pathway enrichment

Setup

library(enrichR)
dbs <- listEnrichrDbs()
library(clusterProfiler)
library(cowplot)

NMD pathway

gmt file is downlaoded from GSEA mouse gene set NMD

gmt_df <- read.gmt("GOBP_NUCLEAR_TRANSCRIBED_MRNA_CATABOLIC_PROCESS_NONSENSE_MEDIATED_DECAY.v2023.1.Mm.gmt")
## Warning in readLines(gmtfile): incomplete final line found on
## 'GOBP_NUCLEAR_TRANSCRIBED_MRNA_CATABOLIC_PROCESS_NONSENSE_MEDIATED_DECAY.v2023.1.Mm.gmt'
annot.gmt <- AnnotationDbi::select(mm, keys=gmt_df$gene, keytype="SYMBOL", columns="ENSEMBL", multiVals="first")
## 'select()' returned 1:1 mapping between keys and columns
head(annot.gmt)
annot.gmt <- annot.gmt[!duplicated(annot.gmt$SYMBOL), ]
gmt_df$gene <- annot.gmt$ENSEMBL

# rank genes by log2FoldChange 
gene_list = DEG.res.sorted.df$log2FoldChange
names(gene_list) <- DEG.res.sorted.df$ENSEMBL
gene_list<- sort(gene_list, decreasing = TRUE)


gse.nmdset <- GSEA(gene_list, TERM2GENE = gmt_df, pvalueCutoff = 1)
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (19.15% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize,
## gseaParam, : There are duplicate gene names, fgsea may produce unexpected
## results.
## leading edge analysis...
## done...
gse.nmdset # check the result 
## #
## # Gene Set Enrichment Analysis
## #
## #...@organism     UNKNOWN 
## #...@setType      UNKNOWN 
## #...@geneList     Named num [1:40346] 4.47 4.44 4.18 4.12 3.96 ...
##  - attr(*, "names")= chr [1:40346] "ENSMUSG00000029371" "ENSMUSG00000082534" "ENSMUSG00000112424" "ENSMUSG00000040969" ...
## #...nPerm     
## #...pvalues adjusted by 'BH' with cutoff <1 
## #...1 enriched terms found
## 'data.frame':    1 obs. of  11 variables:
##  $ ID             : chr "GOBP_NUCLEAR_TRANSCRIBED_MRNA_CATABOLIC_PROCESS_NONSENSE_MEDIATED_DECAY"
##  $ Description    : chr "GOBP_NUCLEAR_TRANSCRIBED_MRNA_CATABOLIC_PROCESS_NONSENSE_MEDIATED_DECAY"
##  $ setSize        : int 42
##  $ enrichmentScore: num -0.253
##  $ NES            : num -0.812
##  $ pvalue         : num 0.823
##  $ p.adjust       : num 0.823
##  $ qvalue         : logi NA
##  $ rank           : num 30147
##  $ leading_edge   : chr "tags=100%, list=75%, signal=25%"
##  $ core_enrichment: chr "ENSMUSG00000001415/ENSMUSG00000094973/ENSMUSG00000073460/ENSMUSG00000040613/ENSMUSG00000002210/ENSMUSG000000583"| __truncated__
## #...Citation
##  T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
##  clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
##  The Innovation. 2021, 2(3):100141
gseaplot(gse.nmdset, geneSetID = 1, by = "runningScore", title="NMD pathway")

KEGG pathway enrichment

sig_DEGs <- subset(DEG.res.sorted.df.sig, padj < 0.05, abs(log2FoldChange)>0.2)
sig_upReg_DEGs <- subset(sig_DEGs, log2FoldChange > 0)
sig_downReg_DEGs <- subset(sig_DEGs, log2FoldChange < 0)
sig_DEGs <- enrichr(as.vector(na.omit(sig_upReg_DEGs$SYMBOL)), "KEGG_2019_Mouse")
## Uploading data to Enrichr... Done.
##   Querying KEGG_2019_Mouse... Done.
## Parsing results... Done.
upReg_pathways <- enrichr(as.vector(na.omit(sig_upReg_DEGs$SYMBOL)), "KEGG_2019_Mouse")
## Uploading data to Enrichr... Done.
##   Querying KEGG_2019_Mouse... Done.
## Parsing results... Done.
downReg_pathways <- enrichr(as.vector(na.omit(sig_downReg_DEGs$SYMBOL)), "KEGG_2019_Mouse")
## Uploading data to Enrichr... Done.
##   Querying KEGG_2019_Mouse... Done.
## Parsing results... Done.

Plot pathways

sigDEG_plot <- plotEnrich(sig_DEGs[[1]], showTerms = 15, numChar = 40, y = "Count", orderBy = "Adjusted.P.value", title = "enriched pathways, padj < 0.05, logFoldChange>0.2")
sigDEG_plot

upReg_plot <- plotEnrich(upReg_pathways[[1]], showTerms = 15, numChar = 40, y = "Count", orderBy = "Adjusted.P.value", title = "Upregulated pathways")

downReg_plot <- plotEnrich(downReg_pathways[[1]], showTerms = 15, numChar = 40, y = "Count", orderBy = "Adjusted.P.value")
combined_plot <- plot_grid(upReg_plot, downReg_plot, labels = c('A', 'B'))
upReg_plot

downReg_plot

ggsave(
  "Pathway_enrichment.png",
  combined_plot ,
  width = 12,
  height = 5,
  units = "in",
  dpi = 300
)

Authors’ feature counts for SMG6:

cd GSE186964_RAW/
grep "ENSMUSG00000038290" *.txt | cut -f 7
2003
2483
1942
1787
1879
2238
1885
2144
1698
2000
2133

Download bam files:

cd /Users/jennie2000/Desktop/5004ANGSD/angsd_project/bam_files
bam="/home/xil4009/scratch_angsd/SMG6_RNA_SEQ/alignment_star/SRR16684385.Aligned.sortedByCoord.out.bam"
bai="/home/xil4009/scratch_angsd/SMG6_RNA_SEQ/alignment_star/SRR16684385.Aligned.sortedByCoord.out.bam.bai"

scp -r xil4009@aphrodite.med.cornell.edu:${bam or bai} .
scp -r xil4009@aphrodite.med.cornell.edu:${bai} .

scp -r xil4009@aphrodite.med.cornell.edu:/home/xil4009/scratch_angsd/SMG6_RNA_SEQ/alignment_star/*.{bam,bai} .